AITopics | video action recognition

Storyboard-guided Alignment for Fine-grained Video Action Recognition

Neural Information Processing SystemsJun-17-2026, 07:19:32 GMT

Fine-grained video action recognition can be formulated as a video-text matching problem. Previous approaches primarily rely on global video semantics to consolidate video embeddings, often leading to misaligned video-text pairs due to inaccurate atomic-level action understanding. This inaccuracy arises due to i) videos with distinct global semantics may share similar atomic actions or visual appearances, and ii) atomic actions can be momentary, gradual, or not directly aligned with overarching video semantics. Inspired by storyboarding, where a script is segmented into individual shots, we propose a multi-granularity framework, SFAR. SFAR generates fine-grained descriptions of common atomic actions for each global semantic using a large language model. Unlike existing works that refine global semantics with auxiliary video frames, SFAR introduces a filtering metric to ensure correspondence between the descriptions and the global semantics, eliminating the need for direct video involvement and thereby enabling more nuanced recognition of subtle actions. By leveraging both global semantics and fine-grained descriptions, our SFAR effectively identifies prominent frames within videos, thereby improving the accuracy of embedding aggregation. Extensive experiments on various video action recognition datasets demonstrate the competitive performance of our SFAR in supervised, few-shot, and zero-shot settings.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Leisure & Entertainment > Sports (0.46)
Health & Medicine > Consumer Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

cef53466b62aebbcf8aa2210a89b33a1-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsApr-29-2026, 20:04:28 GMT

artificial intelligence, machine learning, occlusion, (20 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Government (1.00)
Information Technology (0.93)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

cb3213ada48302953cb0f166464ab356-Paper.pdf

Neural Information Processing SystemsFeb-19-2026, 09:12:07 GMT

arxiv preprint arxiv, representation, transformer, (15 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Revealing the unseen: Benchmarking video action recognition under occlusion

Neural Information Processing SystemsFeb-17-2026, 05:03:20 GMT

In this work, we study the effect of occlusion on video action recognition.

artificial intelligence, machine learning, occlusion, (20 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Government (1.00)
Information Technology (0.93)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Compressed Video Prompt Tuning Bing Li1,2 Jiaxin Chen

Neural Information Processing SystemsFeb-12-2026, 21:14:02 GMT

Compressed videos offer a compelling alternative to raw videos, showing the possibility to significantly reduce the on-line computational and storage cost.

artificial intelligence, machine learning, natural language, (14 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)

Add feedback

5bd529d5b07b647a8863cf71e98d651a-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 20:53:22 GMT

action recognition, normalization parameter, recognition, (14 more...)

Neural Information Processing Systems

Country: Asia > China (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

3776558654d8db1bfcb9ebde0e01184e-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 07:56:38 GMT

Wethus add more parameters in the head network and see ifthis could close the gap. As UPerNet has anFPN-likehead network, we 1 add parameters by replacing FPN with BiFPN. Fromthisfigure,wecan observethat the features across heads inthe Transformer decoder are almost the same. Semantic Segmentation on ADE20KFor the semantic segmentation task, we adopt widelyused ADE20K [11] as the benchmark. Table 7: Hyperparameters for the frozen setting and full finetuning on Kinetics-400 video action recognition.

artificial intelligence, batch frozen fullft, frozen fullft, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.52)

Add feedback

3776558654d8db1bfcb9ebde0e01184e-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 07:56:35 GMT

action recognition, arxiv preprint arxiv, recognition, (15 more...)

Neural Information Processing Systems

Country:

Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

OmniVL: OneFoundationModelforImage-Language andVideo-Language Tasks

Neural Information Processing SystemsFeb-7-2026, 23:04:40 GMT

This paper presents OmniVL, a new foundation model to support both imagelanguage and video-language tasks using one universal architecture.

machine learning, natural language, wang, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

CAST: Cross-Attention in Space and Time for Video Action Recognition

Neural Information Processing SystemsDec-27-2025, 06:39:25 GMT

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-Kitchens-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics. The code is available at https://github.com/KHU-VLL/CAST.

name change, space and time, video action recognition, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.45)

Add feedback

Filters

Collaborating Authors

video action recognition

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Storyboard-guided Alignment for Fine-grained Video Action Recognition

cef53466b62aebbcf8aa2210a89b33a1-Paper-Datasets_and_Benchmarks.pdf

cb3213ada48302953cb0f166464ab356-Paper.pdf

Revealing the unseen: Benchmarking video action recognition under occlusion

Compressed Video Prompt Tuning Bing Li1,2 Jiaxin Chen

5bd529d5b07b647a8863cf71e98d651a-Paper.pdf

3776558654d8db1bfcb9ebde0e01184e-Supplemental-Conference.pdf

3776558654d8db1bfcb9ebde0e01184e-Paper-Conference.pdf

OmniVL: OneFoundationModelforImage-Language andVideo-Language Tasks

CAST: Cross-Attention in Space and Time for Video Action Recognition